probability mass
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
Lochab, Anamika, Li, Bolian, Zhang, Ruqi
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.
Center Smoothing: Certified Robustness for Networks with Structured Outputs Appendix
Let, y be a point in that intersection. Since, by definition, หr(x0,) is the radius of the smallest ball with 1/2 + probability mass of f(x0 + P) over all possible centers in Rk and หRis the radius of the smallest such ball centered at หf(x), we must have หr(x0,) หR. Consider the smallest ball B(z0,หr(x, 1)) that encloses at least 1/2 + 1 probability mass of f(x+ P). Since, r is the radius of the minimum enclosing ball that contains at least half of the points in Z, we have r หr(x, 1). Now, using the definition of หRand following the same reasoning as theorem 2, we can say that, d( หf(x), หf(x0)) ฮฒหr(x0,) + หR (1 + ฮฒ) หR.
Supplementary Material Proofs from Section 2
The proof of Claim 2.3 is obtained via the following calculation, using the definition of Hermite tensor (Definition 2.2). We will use i,j for indexes in [d]. The above is equivalent to Hk(Bx) = B kHk(x). We construct the truncated distribution A as follows. We first sample x A, then we reject x unless x 2 B. Let A be the distribution of the samples we get from this process. Using Markov's inequality and union bound, we have Then it only remains to verify Ex A [Hk(x)] Ex Nm[Hk(x)] 2 for any k < d.
Computational and Statistical Hardness of Calibration Distance
The distance from calibration, introduced by Bลasiok, Gopalan, Hu, and Nakkiran (STOC 2023), has recently emerged as a central measure of miscalibration for probabilistic predictors. We study the fundamental problems of computing and estimating this quantity, given either an exact description of the data distribution or only sample access to it. We give an efficient algorithm that exactly computes the calibration distance when the distribution has a uniform marginal and noiseless labels, which improves the $O(1/\sqrt{|\mathcal{X}|})$ additive approximation of Qiao and Zheng (COLT 2024) for this special case. Perhaps surprisingly, the problem becomes $\mathsf{NP}$-hard when either of the two assumptions is removed. We extend our algorithm to a polynomial-time approximation scheme for the general case. For the estimation problem, we show that $ฮ(1/ฮต^3)$ samples are sufficient and necessary for the empirical calibration distance to be upper bounded by the true distance plus $ฮต$. In contrast, a polynomial dependence on the domain size -- incurred by the learning-based baseline -- is unavoidable for two-sided estimation. Our positive results are based on simple sparsifications of both the distribution and the target predictor, which significantly reduce the search space for computation and lead to stronger concentration for the estimation problem. To prove the hardness results, we introduce new techniques for certifying lower bounds on the calibration distance -- a problem that is hard in general due to its $\textsf{co-NP}$-completeness.